Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix KeyError raised by add_files when parquet file doe not have column stats #1354

Merged
merged 5 commits into from
Nov 25, 2024

Conversation

binayakd
Copy link
Contributor

Resolves #1353, by switching del with pop to prevent KeyError.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to test this?

I think this change is good since there is a fair chance that the fields are not in the map. Keep in mind, when you add Parquet files that don't have any stats, they will be included in every table scan as PyIceberg has no information to decide if the file is relevant for the query.

@Fokko Fokko added this to the PyIceberg 0.8.1 release milestone Nov 21, 2024
@binayakd
Copy link
Contributor Author

Thanks @Fokko! Yes that did come to mind, I was also thinking of its possible to create the stats on the fly, but though it might be left as an enhancement.

Ok let me try and add some unit tests.

@binayakd
Copy link
Contributor Author

@Fokko I have added the unit test, hopefully its up to make.

@Fokko
Copy link
Contributor

Fokko commented Nov 21, 2024

@binayakd Thanks for adding these tests 👍 Can you run make lint to fix the formatting issues?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few nit comments, thanks for the contribution!

tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
@binayakd
Copy link
Contributor Author

@kevinjqliu, @Fokko, pushed the linting fix, and also updated the test based on the suggestions. Thank you!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM, have a few nit comments

tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
@binayakd
Copy link
Contributor Author

Pushed a change to fix the python 3.9 compatibility and updated the test based on the comment, @kevinjqliu. Thanks!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last nit comment about test readability :)

tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
tests/io/test_pyarrow_stats.py Outdated Show resolved Hide resolved
@binayakd
Copy link
Contributor Author

@kevinjqliu pushed the test readability fix. Thanks!

@Fokko Fokko requested a review from kevinjqliu November 25, 2024 14:04
@Fokko
Copy link
Contributor

Fokko commented Nov 25, 2024

Gentle ping @kevinjqliu, so we can wrap the 0.8.1 release up :)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the contribution :)

@kevinjqliu kevinjqliu merged commit ab43c6c into apache:main Nov 25, 2024
8 checks passed
kevinjqliu pushed a commit to kevinjqliu/iceberg-python that referenced this pull request Nov 25, 2024
…olumn stats (apache#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
kevinjqliu added a commit that referenced this pull request Nov 27, 2024
* use the non-deprecated func (#1326)

* 0.8.0 post release steps (#1334)

* add

* fix mkdoc

* Drop upper bounds for fsspec and it's implementations (#1341)

* Drop upper bounds for fsspec and it's implementations

* Run poetry lock

* Ignore tables without `table_type` from Glue and Hive

* Ignore tables without table_type parameters while loading all iceberg table from Glue and Hive catalog (#1331)

* Use TABLE_TYPE

---------

Co-authored-by: Wenzhuo Zhao <zhaowenzhuo01@bilibili.com>

* Replace reference of `Table.identifier` with `Table.name` (#1346)

* fix Table.name

* replace Table.identifier with Table.name

* add warning filter

* Allow leading underscore in column name used in row filter (#1358)

* Update parser.py

Allow leading underscore in column name used in row filter.

* Update test_parser.py

* Update test_parser.py

* Update test_parser.py

* Remove Python 3.13 upper bound restriction (#1355)

* Remove Python 3.13 upper bound restriction

* Fix missing poetry.lock file

* Upgrading numpy on the poetry.lock file from v1.26.0 to v1.26.4

* Improve documentation for "how to release" (#1359)

* initial update

* edits

* add gpg instructions

* verify artifacts

* add twine not

* grammar

* edits

* remove old artifacts

* update doc workflow action

* and name

* add docs on patch vs major/minor release

* fix `KeyError` raised by `add_files` when parquet file doe not have column stats (#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test

* bump to 0.8.1

* Add instruction for patch release (#1373)

* add instruction for patch release

* create branch from tag

* Write `null` when there is no parent-snapshot-id (#1383)

---------

Co-authored-by: Sumanth <33193748+sumanth-manchala@users.noreply.github.com>
Co-authored-by: gitzwz <72312233+gitzwz@users.noreply.github.com>
Co-authored-by: Wenzhuo Zhao <zhaowenzhuo01@bilibili.com>
Co-authored-by: vincenzon <mvcalder@gmail.com>
Co-authored-by: Luca Bigon <luca.bigon@bauplanlabs.com>
Co-authored-by: Binayak Dasgupta <binayakd86@gmail.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
kevinjqliu pushed a commit to kevinjqliu/iceberg-python that referenced this pull request Nov 27, 2024
…olumn stats (apache#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
kevinjqliu pushed a commit that referenced this pull request Nov 27, 2024
…olumn stats (#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
…olumn stats (apache#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
…olumn stats (apache#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024
…olumn stats (apache#1354)

* fix KeyError, by switching del to pop

* added unit test

* update test

* fix python 3.9 compatibility, and refactor test

* update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add_files raises KeyError if parquet file doe not have column stats
3 participants